Accessing and analysing the OpenAIRE Research Graph data dumps

The OpenAIRE Research Graph provides a wide range of metadata about grant-supported research publications. This blog post presents an experimental R package with helpers for splitting, de-compressing and parsing the underlying data dumps. I will demonstrate how to use them by examining the compliance of funded projects with the open access mandate in Horizon 2020.

Najko Jahn https://twitter.com/najkoja (State and University Library Göttingen)https://www.sub.uni-goettingen.de/
03-30-2020

OpenAIRE has collected and interlinked scholarly data from various openly available sources for over ten years. In December 2019, this open science network released the OpenAIRE Research Graph(Manghi et al. 2019), a big scholarly data dump that contains metadata about more than 100 million research publications and 8 million datasets, as well as the relationships between them. These metadata are furthermore connected to open access locations and disambiguated information about persons, organisations and funders.

Like most big scholarly data dumps, the OpenAIRE Research Graph offers many data analytics opportunities, but working with it is challenging. One reason is the size of the dump. Although the OpenAIRE Research Graph is already split into several files, most of these data are too large to fit the memory of a moderately equipped laptop. Another challenge is the format. The dump consists of compressed XML-files following the comprehensive OpenAIRE data model, from which only certain elements may be needed for a data analysis.

In this blog post, I introduce the R package openairegraph, an experimental effort, that helps to transform the large OpenAIRE Research Graph dumps to relevant small data for a data analysis. These tools aim at data analysts and researchers alike who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps. Focusing on grant-supported research results from the European Commission’s Horizon 2020 framework programme (H2020), I present how to subset and analyse the graph using this openairegraph. My analytical use case is to benchmark the open access activities of grant-supported projects affiliated with the University of Göttingen against the overall uptake across the H2020 funding activities.

What is the R package openairegraph about?

So far, the R package openairegraph, which is available on GitHub, has two sets of functions. The first set provides helpers to split a large OpenAIRE Research Graph data dump into separate, de-coded XML records that can be stored individually. The other set consists of parsers that convert data from these XML files to a table-like representation following the tidyverse philosophy, a popular approach and toolset for doing data analysis with R (Wickham et al. 2019). Splitting, de-coding and parsing are essential steps before analysing the OpenAIRE Research Graph.

Installation

openairegraph can be installed from GitHub using the remotes(Hester et al. 2019) package:


library(remotes)
remotes::install_github("subugoe/openairegraph")

Loading a dump into R

Several dumps from the OpenAIRE Research Graph are available on Zenodo(Manghi et al. 2019). So far, I tested openairegraph to work with the dump h2020_results.gz, which comprises research outputs funded by the European Commission’s Horizon 2020 funding programme (H2020).

After downloading it, the file can be imported into R using the jsonlite package(Ooms 2014). The following example shows that each line contains a record identifier and the corresponding Base64-encoded XML file. Base64 is a standard that allows file compression in a text-based format.


library(jsonlite) # tools to work with json files
library(tidyverse) # tools from the tidyverse useful for data analysis
oaire <- jsonlite::stream_in(file("data/h2020_results.gz"), verbose = FALSE) %>%
  tibble::as_tibble()
oaire
#> # A tibble: 92,218 x 2
#>    `_id`$`$oid`       body$`$binary`                          $`$type`
#>    <chr>              <chr>                                   <chr>   
#>  1 5dbc22f81e82127b5… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  2 5dbc22f9b531c546e… UEsDBBQACAgIAIRiYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  3 5dbc22fa45e3122d9… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  4 5dbc22fa45e3122d9… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  5 5dbc22fa4e0c061a4… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  6 5dbc22fb81f3c12c0… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  7 5dbc22fb895be1246… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  8 5dbc22fbe56570673… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#>  9 5dbc22fc81f3c12bf… UEsDBBQACAgIAIViYU8AAAAAAAAAAAAAAAAEAA… 00      
#> 10 5dbc22fcb531c546e… UEsDBBQACAgIAIZiYU8AAAAAAAAAAAAAAAAEAA… 00      
#> # … with 92,208 more rows

De-coding and storing OpenAIRE Research Graph records

The function openairegraph::oarg_decode() splits and de-codes each record. Storing the records individually allows to process the files independent from each other, which is a common approach when working with big data.


library(openairegraph)
openairegraph::oarg_decode(oaire, records_path = "data/records/", 
  limit = 500, verbose = FALSE)

openairegraph::oarg_decode() writes out each XML-formatted record as a zip file to a specified folder. Because the dumps are quite large, the function furthermore has a parameter that allows setting a limit, which is helpful for inspecting the output first. By default, a progress bar presents the current state of the process.

Parsing OpenAIRE OpenAIRE Research Graph records

So far, there are four parser available to consume the H2020 results set:

These parsers can be used alone, or together like this:

First, I obtain the locations of the de-coded XML records.


openaire_records <- list.files("data/records", full.names = TRUE)

After that, I read each XML file using the xml2(Wickham, Hester, and Ooms 2019) package, and apply three parsers: openairegraph::oarg_publications_md(), openairegraph::oarg_linked_projects() and openairegraph::oarg_linked_ftxt(). I use the future(Bengtsson 2020b) and future.apply(Bengtsson 2020a) packages to enable reading and parsing these records simultaneously with multiple R sessions. Running code in parallel reduces the execution time.


library(xml2) # working with xml files
library(future) # parallel computing
library(future.apply) # functional programming with parallel computing
library(tictoc) # timing functions

openaire_records <- list.files("data/records", full.names = TRUE)

future::plan(multisession)
tic()
oaire_data <- future.apply::future_lapply(openaire_records, function(files) {
  # load xml file
  doc <- xml2::read_xml(files)
  # parser
  out <- oarg_publications_md(doc)
  out$linked_projects <- list(oarg_linked_projects(doc))
  out$linked_ftxt <- list(oarg_linked_ftxt(doc))
  # use file path as id
  out$id <- files
  out
})
toc()
#> 44.751 sec elapsed
oaire_df <- dplyr::bind_rows(oaire_data)

A note on performance: Parsing the whole dump h2020_results using these parsers took me around 2 hours. I therefore recommend to back up the resulting data, instead of un-packing the whole dump for each analysis. jsonlite::stream_out() outputs the data frame to a text-based json-file, where list-columns are preserved per row.


jsonlite::stream_out(oaire_df, file("data/h2020_parsed_short.json"))
#> 
Processed 500 rows...
Complete! Processed total of 500 rows.

Use case: Monitoring the Open Access Compliance across H2020 grant-supported projects at the institutional-level

Usually, individual researchers do not sign grant agreements with the European Commission (EC), but the institution they are affiliated with. Universities and other research institutions administering EC-funded projects are therefore looking for ways to monitor the compliance with funder rules. In the case of the open access mandate in Horizon 2020 (H2020), librarians are often assigned this task. And also quantitative science studies have started to address the efficacy of funders’ open-access mandates.(Larivière and Sugimoto 2018)

In this use case, I will illustrate how to make use of the OpenAIRE Research Graph, which links grants to publications and open access full-texts, to benchmark compliance with the open access mandate against other H2020 funding activities.

Overview

As a start, I load a dataset, which was compiled following the above-described methods using the whole h2020_results.gz dump.


oaire_df <- jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
  tibble::as_tibble()

It contains 92,218 grant-supported research outputs. Here, I will focus on the prevalence of open access across H2020 projects using metadata about the open access status of a publication and related project information stored in the list-column linked_projects.


pubs_projects <- oaire_df %>%
  filter(type == "publication") %>%
  select(id, type, best_access_right, linked_projects) %>%
  # transform to a regular data frame with a row for each project
  unnest(linked_projects) 

The dataset contains 84,781 literature publications from 9,008 H2020 projects. What H2020 funding activity published most?


library(cowplot)
library(scales)
pubs_projects %>%
  filter(funding_level_0 == "H2020") %>% 
  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
  group_by(funding_scheme) %>%
  summarise(n = n_distinct(id)) %>%
  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
  mutate(highlight = ifelse(funding_scheme %in% c("ERC", "RIA"), "yes", "no")) %>%
  ggplot(aes(reorder(funding_fct, n), n, fill = highlight)) +
  geom_bar(stat = "identity") +
  coord_flip() +
  scale_fill_manual(
    values = c("#B0B0B0D0", "#56B4E9D0"),
    name = NULL) +
  scale_y_continuous(
    labels = scales::number_format(big.mark = ","),
    expand = expansion(mult = c(0, 0.05)),
    breaks =  scales::extended_breaks()(0:25000)
    ) +
  labs(x = NULL, y = "Publications", caption = "Data: OpenAIRE Research Graph") +
  theme_minimal_vgrid(font_family = "Roboto") +
  theme(legend.position = "none")
Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019.

Figure 1: Publication Output of Horizon 2020 funding activities captured by the OpenAIRE Research Graph, released in December 2019.

Figure 1 shows that most publications in the OpenAIRE Research Graph originate from the European Research Council (ERC), Research and Innovation Actions (RIA) and Marie Skłodowska-Curie Actions (MSCA). On average, a project published 10 articles. However, the publication performance per H2020 funding activity varies considerably (SD = 33).

The European Commission mandates open access to publications. Let’s measure the compliance to this policy using the OpenAIRE Research Graph per project:


library(rmarkdown)
oa_monitor_ec <- pubs_projects %>%
  filter(funding_level_0 == "H2020") %>%
  mutate(funding_scheme = fct_infreq(funding_level_1)) %>%
  group_by(funding_scheme,
           project_code,
           project_acronym,
           best_access_right) %>%
  summarise(oa_n = n_distinct(id)) %>% # per pub
  mutate(oa_prop = oa_n / sum(oa_n)) %>%
  filter(best_access_right == "Open Access") %>%
  ungroup() %>%
  mutate(all_pub = as.integer(oa_n / oa_prop)) 
rmarkdown::paged_table(oa_monitor_ec)

In the following, this aggregated data, oa_monitor_ec, will provide the basis to explore variations among and within H2020 funding programmes.


oa_monitor_ec %>%
  # only projects with at least five publications
  mutate(funding_fct = fct_other(funding_scheme, keep = levels(funding_scheme)[1:10])) %>%
  filter(all_pub >= 5) %>%
  ggplot(aes(fct_rev(funding_fct), oa_prop)) +
  geom_boxplot() +
  geom_hline(aes(
    yintercept = mean(oa_prop),
    color = paste0("Mean=", as.character(round(
      mean(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  geom_hline(aes(
    yintercept = median(oa_prop),
    color = paste0("Median=", as.character(round(
      median(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  scale_color_manual("H2020 OA Compliance", values = c("orange", "darkred")) +
  coord_flip() +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L),
                     expand = expansion(mult = c(0, 0.05))) +
  labs(x = NULL,
       y = "Open Access Percentage",
       caption = "Data: OpenAIRE Research Graph") +
  theme_minimal_vgrid(font_family = "Roboto") +
  theme(legend.position = "top",
        legend.justification = "right")
Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications were considered.

Figure 2: Open Access Compliance Rates of Horizon 2020 projects relative to funding activities, visualised as box plot. Only projects with at least five publications were considered.

About 77% of research pubilcations under the H2020 open access mandate are openly available. Figure 2 highlights a generally high rate of compliance with the open access mandate. However, uptake levels vary among and within the H2020 funding programme.

Because of their large variations, I want to put the open access rates of H2020-funded projects in context when presenting the share for projects affiliated with the University of Göttingen. Again, the data analysis starts with loading the previously backed up file with decoded and parsed data, choosing project and access information from it.


oaire_df <- jsonlite::stream_in(file("data/h2020_parsed.json"), verbose = FALSE) %>%
  tibble::as_tibble()

pubs_projects <- oaire_df %>%
  select(id, type, best_access_right, linked_projects) %>%
  unnest(linked_projects) 
pubs_projects
#> # A tibble: 136,298 x 12
#>    id    type  best_access_rig… to    project_title funder
#>    <chr> <chr> <chr>            <chr> <chr>         <chr> 
#>  1 data… publ… Open Access      proj… Planning and… Europ…
#>  2 data… publ… Open Access      proj… Cortical alg… Europ…
#>  3 data… publ… Open Access      proj… Human Brain … Europ…
#>  4 data… publ… Restricted       proj… Implementati… Europ…
#>  5 data… publ… Open Access      proj… The power of… Europ…
#>  6 data… publ… Open Access      proj… A psychologi… Wellc…
#>  7 data… publ… Open Access      proj… Effects of N… Europ…
#>  8 data… publ… Open Access      proj… Aggression s… Europ…
#>  9 data… publ… Open Access      proj… Global trend… Europ…
#> 10 data… publ… Open Access      proj… Mapping grav… Europ…
#> # … with 136,288 more rows, and 6 more variables:
#> #   funding_level_0 <chr>, funding_level_1 <chr>, project_code <chr>,
#> #   project_acronym <chr>, contract_type <chr>, funding_level_2 <chr>

Next, I want to identify H2020 projects with participation from the university. There are at least two ways to obtain links between projects and organisations: One is the OpenAIRE Research Graph. It provides project details from 29 funder in a separate dump, project.gz. Another option is to relate our dataset to open data provided by CORDIS, the European Commission’s research information portal. For convenience, I am going to follow the second option.


# load local copy downloaded from the EC open data portal
cordis_org <-
  readr::read_delim(
    "data/cordis-h2020organizations.csv",
    delim = ";",
    locale = locale(decimal_mark = ",")
  ) %>%
  # data cleaning
  mutate_if(is.double, as.character) 

After loading the file, I am able to tag projects affiliated with the University of Göttingen.


ugoe_projects <- cordis_org %>%
  filter(shortName %in% c("UGOE", "UMG-GOE")) %>% 
  select(project_id = projectID, role, project_acronym = projectAcronym)

pubs_projects_ugoe <- pubs_projects %>%
  mutate(ugoe_project = funding_level_0 == "H2020" & project_code %in% ugoe_projects$project_id)

Let’s put it all together and benchmark the rates of compliance with the H2020 open access mandate using data from the OpenAIRE Research Graph. The package plotly(Sievert 2018) allows presenting the figure as interactive chart.


# funding programmes with Uni Göttingen participation
ugoe_funding_programme <- pubs_projects_ugoe %>% 
  filter(ugoe_project == TRUE) %>%
  group_by(funding_level_1, project_code) %>% 
  # min 5 pubs
  summarise(n = n_distinct(id)) %>%
  filter(n >= 5) %>%
  distinct(funding_level_1, project_code)
goe_oa <- oa_monitor_ec %>%
  # min 5 pubs
  filter(all_pub >=5) %>%
  filter(funding_scheme %in% ugoe_funding_programme$funding_level_1) %>%
  mutate(ugoe = project_code %in% ugoe_funding_programme$project_code) %>%
  mutate(`H2020 project` = paste0(project_acronym, " | OA share: ", round(oa_prop * 100, 0), "%"))
# plot as interactive graph using plotly
library(plotly)
p <- ggplot(goe_oa, aes(funding_scheme, oa_prop)) +
  geom_boxplot() +
  geom_jitter(data = filter(goe_oa, ugoe == TRUE),
               aes(label = `H2020 project`),
             colour = "#AF42AE",
             alpha = 0.9,
             size = 3) +
  geom_hline(aes(
    yintercept = mean(oa_prop),
    color = paste0("Mean=", as.character(round(
      mean(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  geom_hline(aes(
    yintercept = median(oa_prop),
    color = paste0("Median=", as.character(round(
      median(oa_prop) * 100, 0
    )), "%")
  ),
  linetype = "dashed",
  size = 1) +
  scale_color_manual(NULL, values = c("orange", "darkred")) +
  scale_y_continuous(labels = scales::percent_format(accuracy = 5L)) +
  labs(x = NULL,
       y = "Open Access Percentage",
       caption = "Data: OpenAIRE Research Graph") +
  theme_minimal(base_family = "Roboto") +
  theme(legend.position = "top",
        legend.justification = "right")
plotly::ggplotly(p, tooltip = c("label"))

Figure 3: Open Access Compliance Rates of Horizon 2020 projects affiliated with the University of Göttingen (purple dots) relative to the overall performance of the funding activity, visualised as box plot. Only projects with at least five publications were considered. Data: OpenAIRE Research Graph(Manghi et al. 2019)

Figure 3 shows that many H2020-projects with University of Göttingen participation have an uptake of open access to grant-supported publications that is above the average in the peer group. At the same time, some perform below expectation. Together, this provides a valuable insight into open access compliance at the university-level, especially for research support librarians who are in charge of helping grantees to make their work open access.

Discussion and conclusion

Using data from the OpenAIRE Research Graph dumps makes it possible to put the results of a specific data analysis into context. Open access compliance rates of H2020 projects vary. These variations should be considered when reporting compliance rates of specific projects under the same open access mandate.

Although the OpenAIRE Research Graph is a large collection of scholarly data, it is likely that it still does not provide the whole picture. OpenAIRE mainly collects data from open sources. It is still unknown, how the OpenAIRE Research Graph compares to well-established toll-access bibliometrics data sources like the Web of Science in terms of coverage and data quality.

As a member of the OpenAIRE consortium, improving the re-use of the OpenAIRE Research Graph dumps has become a SUB Göttingen working priority. In the scholarly communication analysts team, we want support this with a number of data analyses and outreach activities. In doing so, we will add more helpers to the openairegraph R package. It targets data analyst and researchers who wish to conduct their own analysis using the OpenAIRE Research Graph, but are wary of handling its large data dumps.

If you like to contribute, head on over to the packages’ source code repository and get started!

Bengtsson, Henrik. 2020a. Future.apply: Apply Function to Elements in Parallel Using Futures. https://CRAN.R-project.org/package=future.apply.

———. 2020b. Future: Unified Parallel and Distributed Processing in R for Everyone. https://CRAN.R-project.org/package=future.

Hester, Jim, Gábor Csárdi, Hadley Wickham, Winston Chang, Martin Morgan, and Dan Tenenbaum. 2019. Remotes: R Package Installation from Remote Repositories, Including ’Github’. https://CRAN.R-project.org/package=remotes.

Larivière, Vincent, and Cassidy R. Sugimoto. 2018. “Do Authors Comply When Funders Enforce Open Access to Research?” Nature 562 (7728): 483–86. https://doi.org/10.1038/d41586-018-07101-w.

Manghi, Paolo, Claudio Atzori, Alessia Bardi, Jochen Schirrwagen, Harry Dimitropoulos, Sandro La Bruzzo, Ioannis Foufoulas, et al. 2019. “OpenAIRE Research Graph Dump.” Zenodo. https://doi.org/10.5281/zenodo.3516918.

Ooms, Jeroen. 2014. “The Jsonlite Package: A Practical and Consistent Mapping Between Json Data and R Objects.” arXiv:1403.2805 [stat.CO]. https://arxiv.org/abs/1403.2805.

Sievert, Carson. 2018. Plotly for R. https://plotly-r.com.

Wickham, Hadley, Mara Averick, Jennifer Bryan, Winston Chang, Lucy D’Agostino McGowan, Romain François, Garrett Grolemund, et al. 2019. “Welcome to the Tidyverse.” Journal of Open Source Software 4 (43): 1686. https://doi.org/10.21105/joss.01686.

Wickham, Hadley, Jim Hester, and Jeroen Ooms. 2019. Xml2: Parse Xml. https://CRAN.R-project.org/package=xml2.